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Letter to the Editor 


Severe Acute Respiratory Syndrome Coronavirus Sequence Characteristics and 
Evolutionary Rate Estimate from Maximum Likelihood Analysis 


In November 2002, a previously unknown severe acute re- 
spiratory syndrome (SARS) was observed in patients of the 
Guangdong Province, China (7). In March 2003, a new coro- 
navirus (SARS-CoV) was associated with the SARS outbreak 
(2), and several full-genome sequences of SARS-CoV were 
obtained and compared (4). The family Coronaviridae com- 
prises large, single, plus-stranded RNA viruses isolated from 
several species and previously known to cause common colds 
and diarrheal illnesses in humans (3). The emergence of such 
a novel, highly virulent pathogen warrants rapid investigation 
of its etiology and evolution to effectively control its impact on 
human health. In particular, estimating the rate of evolution of 
SARS-CoV would give an indication of how quickly the virus 
can potentially increase its genetic variability, which in turn has 
important implications for disease progression and drug and 
vaccine development. 

Phylogenetic analysis has proven successful for the investi- 
gation and prediction of the evolution of viruses such as influ- 
enza virus (1). Initial inspection of SARS-CoV sequences re- 
vealed a high degree of homogeneity, which might indicate an 
RNA virus that evolves unusually slowly. To investigate fur- 
ther, we carried out a full-genome alignment of the available 
SARS-CoV strains recently analyzed by Ruan et al. (4) by use 
of the CLUSTAL algorithm (6). The alignment was carefully 
edited by hand to maximize the number of identities, and the 
site positions containing gaps were removed. The resulting 
alignment (available from the authors upon request) is 21,333 
nucleotides long; 63 sites have at least one sequence with a 
different nucleotide, and only 10 sites are phylogenetically in- 
formative, i.e., they are useful to discriminate among different 
tree topologies, according to the unweighted parsimony crite- 
rion. Subalignments were generated for all of the known cod- 
ing regions, most of which were identical among the different 
isolates. We analyzed open reading frame (ORF) lab (4), 
which appears to be the most variable. Maximum likelihood 
(ML) methods were employed for the analyses because they 
allow for the testing of different phylogenetic hypotheses by 
calculating the probability of a given model of evolution gen- 
erating the observed data and by comparing the probabilities 
of nested models by the likelihood ratio test (5). In addition, 
because only 10 sequences were retained after excluding the 
identical ones, it was possible to search for the optimal ML tree 
through an exhaustive or branch-and-bound search (5). 

Table 1 shows the average base composition and the ML 
estimates of parameters describing the mode of evolution of 
SARS-CoV in ORF lab. The a parameter of the I’ distribution 
is extremely low (0.008), implying an extensive heterogeneity in 
the rate at which different nucleotide sites mutate along the 
genome. Moreover, the ML estimator implies that about 90% 
of the constant sites in the sequences are indeed invariable, i.e., 
they never change, possibly because of strong purifying selec- 
tion. The variable sites, on the other hand, accumulate muta- 
tions very quickly. However, a note of caution is necessary 
because such a result may also be due to the small number of 
sequences available for analysis and the very short observation 
period. Table 1 also shows that the hypothesis of a molecular 
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clock cannot be rejected, although the P value is very close to 
0.05; ie., SARS-CoV isolates appear to be evolving at a con- 
stant evolutionary rate, which can be estimated from the ML 
tree with clock-like branch lengths shown in Fig. 1. The branch 
lengths in the tree are proportional to the number of mutations 
accumulated by each viral lineage during evolution from the 
cenancestor, the most recent common ancestor. Assuming that 
the SARS-CoV cenancestor entered the human population 4 
to 8 months ago (7), the evolutionary rate of the virus is of the 
order of 4 X 10 * nucleotide changes per site per year (95% 
confidence interval [CI], 2.0 x 10°* to 6 x 10~*) along the 
entire ORF lab. When only the variable sites are considered, 
the estimated rate is noticeably higher: 3.5 x 10° changes per 
site per year (95% CI, 2.6 X 10-° to 4.4 X 107%). This is the 
usual range for an RNA virus. Therefore, on average, eight 
point mutations are expected for the entire ORF Lab region at 
each replication round. However, we cannot exclude the pos- 
sibility that the sequence variability in the data sets is also 
affected by the passage of the virus in Vero cell culture before 
sequencing (4). Figure 1 also shows that the root of the tree, 
inferred by ML, is between the strains isolated from Hong 
Kong and Beijing, which are known to be epidemiologically 
linked to the strains isolated from patients in Guangdong Prov- 
ince and all the others (4). Epidemiological data also indicate 
that the index patient traveled from Guangdong to Hotel M in 
Hong Kong, where he transmitted the virus to several individ- 
uals who successively traveled to Singapore, Canada, and Viet- 
nam (4). The tree shows, indeed, that the Singapore isolate and 
the isolates from Beijing belong to different, statistically sup- 
ported clusters. However, because of the low phylogenetic 
signal, further classification of SARS-CoV isolates is not pos- 
sible by phylogenetic analysis. All analyses confirm that SARS- 


TABLE 1. Maximum likelihood estimators of nucleotide 
substitution model parameters for the SARS virus in ORF 
lab polyprotein“ 


Inferred evolutionary 


ML estimate? ae 
Tree we - 
Transition/ ‘ Variable 
transversion Qa Pinv ay sites 
rate ratio (1073) 
ML* 1.68 0.008 0.910 
MLK? 1.88 0.0024 0.885 4.0 + 2.0 3.5 + 0.9 


“ Percent average base composition was as follows: A, 28.4; C, 19.5; T, 21.3; G, 
30.8. 

» The best-fitting nucleotide substitution model (HKY85+I'+1) was selected 
with a hierarchical likelihood ratio test procedure by using a suboptimal tree (5). 
The model assumes unequal transition and transversion substitution rates, dif- 
ferent categories of sites along the genome changing at different rates (described 
by the a parameter of a I distribution of rates), and a class of invariable sites 
(described by the Pinv parameter). 

© Estimates are based on the maximum likelihood tree. 

¢ The likelihood for each possible rooted tree was obtained, and the best tree, 
according to the Shimodaira-Hasegawa test, was selected as the maximum like- 
lihood clock tree. The clock hypothesis could not be rejected by the likelihood 
ratio test (P = 0.052). 
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FIG. 1. Optimal ML tree of SARS-CoV ORF 1ab nucleotide sequences. Branch lengths are drawn proportional to the number of nucleotide 
changes per site and were estimated via ML enforcing a molecular clock and employing the HK Y85+I°+I nucleotide substitution model (Table 
1). The numbers on the branches represent the percentages of bootstrap-jackknife support (1,000 replicates) for the subtending clade. The P value 


for the zero-branch-length test (7) is also given. 


CoV is not closely related to any known coronavirus (4), al- 
though it is assumed that the source must be one or more 
unidentified animal reservoirs in Asia. 

In conclusion, the low sequence variability of SARS-CoV 
isolates is probably the consequence of its recent emergence in 
humans, but much greater viral heterogeneity with unpredict- 
able consequences may be expected if the epidemic is not 
controlled. A rigorous phylogenetic approach might be an im- 
portant tool to monitor the future evolution of the virus. 
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